Draft

Note: Initial data sets were cleaned and merged prior to this analysis. The original data sets’ structure and cleaning steps may be found in the file us_state_death_trends_wrangling.

Introduction and Objectives

This report explores the Center for Disease Control’s (CDC) Weekly Morbidity and Mortality data from 2014 through the ——>1st quarter of 2021<——–. This data analysis is focused on National level weekly death counts for specific causes at the United States. A 2nd report will analyze specific state and local regions. The data is voluntary and not guaranteed to be reported in a timely or regular basis. Therefore, the most recent data (quarter) will be may be incomplete and unreliable for analysis and insights.

More information on the original data sets can be found at CDC’s website using the following links:

Note: The most current data can be downloaded using the links above. To use the data sets without any code modifications, the user will need to:

  1. Place the data two sets in a subdirectory named “data”, and
  2. Rename the data sets as “weekly_2014_2019” and “weekly_2020_2021.”
# Create United States subset
us_deaths_df <- mmwr_1421_df[Location == "United States", ]

# Dropping unused levels (only United States occurs). 
us_deaths_df$Location <- droplevels(us_deaths_df$Location)

Data Structure

Observations and features

Verify United States data set is smooth at merge (2019 to 2020).

Note: State level data will be verified in individual State analysis.

melt(us_deaths_df[ Week_End_Date > as.Date("2019-09-01") & 
                   Week_End_Date < as.Date("2020-03-31"),
                   Week_End_Date:Abnormal_Finding],
     id.vars = "Week_End_Date") %>%
  
  ggplot(aes(x=Week_End_Date, y = value, group=variable)) +
    geom_point() +
    geom_vline(xintercept=as.Date("2020-01-01"), linetype="dotted") +
    facet_wrap(~ variable, scales = 'free_y') + 
    theme_bw()

There are vertical gaps (jumps) seen at the intersection of the two data sets for: - Influenza and Pneumonia - Other Respiratory - Abnormal Finding

After the jumps, the data settles into a pattern consistent with the trend of the previous years’ data. These gaps are likely related to Covid-19 cases that were undiagnosed due to Covid-19 testing not being available until March of 2020 and not being widely available (without restricted use) until May of 2020. In addition, there was not a Covid-19 diagnosis of death code available in the United States on January 1st of 2020 and therefore any deaths would have been diagnosed as another general respiratory category such as these three.

In addition to the gaps, there are notable peaks at the merge point (date) of the two data sets. These peaks are generally smooth before and after, indicating a local anomaly with a true long-term pattern.

The analysis will comprise mostly of averages and comparisons of year to year descriptive statistics. Therefore, 1-2 week gaps and trends will not effect the analysis.

First three observations (chronologically) of the data set.

head(us_deaths_df, 3)

Last three observations (chronologically) of the data set.

tail(us_deaths_df, 3)
str(us_deaths_df)
## Classes 'data.table' and 'data.frame':   398 obs. of  18 variables:
##  $ Location           : Factor w/ 1 level "United States": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year               : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
##  $ Week               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Week_End_Date      : Date, format: "2014-01-04" "2014-01-11" ...
##  $ Natural            : int  50189 52450 51043 50560 50402 49790 50175 49010 47907 48353 ...
##  $ Heart              : int  13166 13663 12928 12813 12896 12681 12984 12577 12248 12318 ...
##  $ Cancer             : int  11244 11504 11496 11629 11584 11355 11477 11478 11251 11535 ...
##  $ Lower_Respiratory  : int  3331 3444 3333 3467 3283 3351 3303 3047 3008 3043 ...
##  $ Brain              : int  2669 2738 2714 2720 2699 2684 2669 2799 2630 2529 ...
##  $ Alzheimer          : int  1780 1917 1914 1862 1867 1873 1843 1814 1776 1830 ...
##  $ Diabetes           : int  1654 1735 1660 1602 1586 1643 1642 1564 1588 1536 ...
##  $ Covid_19_Multi     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Covid_19           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Influenza_Pneumonia: int  1639 1910 1920 1765 1642 1528 1472 1269 1228 1215 ...
##  $ Kidney             : int  965 1098 1056 1029 998 1038 1021 973 1018 1040 ...
##  $ Other_Respiratory  : int  756 845 812 753 720 728 739 731 687 760 ...
##  $ Septicemia         : int  882 905 919 845 890 849 851 708 779 777 ...
##  $ Abnormal_Finding   : int  679 665 598 622 664 641 638 643 595 642 ...
##  - attr(*, ".internal.selfref")=<externalptr>

After preprocessing and cleaning, the United States (U.S.) subset of data used in this analysis accounts for 398 observations and 18 features which includes categorical location data, chronological date and week of year information, and integer weekly disease death data. The earliest week is the 1st week of January 2014. The most current as of this analysis is the week ending January 24, 2021. Next, is the data set’s summary statistics in the U.S. subset.

Summary statistics

describe(us_deaths_df[, Location:Week_End_Date])
## us_deaths_df[, Location:Week_End_Date] 
## 
##  4  Variables      398  Observations
## --------------------------------------------------------------------------------
## Location 
##             n       missing      distinct         value 
##           398             0             1 United States 
##                         
## Value      United States
## Frequency            398
## Proportion             1
## --------------------------------------------------------------------------------
## Year 
##        n  missing distinct     Info     Mean      Gmd 
##      398        0        8    0.984     2017    2.537 
## 
## lowest : 2014 2015 2016 2017 2018, highest: 2017 2018 2019 2020 2021
##                                                           
## Value       2014  2015  2016  2017  2018  2019  2020  2021
## Frequency     53    52    52    52    52    52    53    32
## Proportion 0.133 0.131 0.131 0.131 0.131 0.131 0.133 0.080
## --------------------------------------------------------------------------------
## Week 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      398        0       53        1    25.83    17.31     3.00     5.70 
##      .25      .50      .75      .90      .95 
##    13.00    25.00    38.75    47.00    50.00 
## 
## lowest :  1  2  3  4  5, highest: 49 50 51 52 53
## --------------------------------------------------------------------------------
## Week_End_Date 
##          n    missing   distinct       Info       Mean        Gmd        .05 
##        398          0        398          1 2017-10-24        931 2014-05-22 
##        .10        .25        .50        .75        .90        .95 
## 2014-10-08 2015-11-29 2017-10-24 2019-09-19 2020-11-09 2021-03-28 
## 
## lowest : 2014-01-04 2014-01-11 2014-01-18 2014-01-25 2014-02-01
## highest: 2021-07-17 2021-07-24 2021-07-31 2021-08-07 2021-08-14
## --------------------------------------------------------------------------------

There are 398 observations beginning with the week ending date of 01-04-2021 to 08-14-2021. The years 2014 and 2020 have a week number 53 due to leap year’s extra day and the last week of the year being splitting the following year, This 53rd week will not significantly affect the results of the analysis due to the analysis using year long and quarterly averaging. Year 2021 only has 32 weeks. Any analysis after week 32 will show 2020 to previous year comparisons. After further review of the data, analysis will be limited to the end of the 2nd quarter of 2021 (see “Examining possible reporting delays in last quarter of data below”).

Cause of weekly deaths summary statistics

Note the Covid-19 data contains fill values of zero for years 2014 through 2019 (inclusive).

as.data.frame(describeBy(us_deaths_df[ , Natural:Abnormal_Finding]))

Summary statistics for Covid features without fill values (2020-2021)

as.data.frame(describeBy(us_deaths_df[ Year > 2019, 
                                       .(Covid_19_Multi, Covid_19)]))

Distributions 2014-2021 (Covid deaths are zero prior to 2020)

us_deaths_df[ , Natural:Abnormal_Finding]%>%
  hist()

Most of the distributions above show data has a normal-like distribution with many having right-skew. The Covid categories are effected by the zeros from years before 2020 when there was not a Covid-19 diagnosis. These will be removed by limiting the data to the years 2020 and greater. The cancer and abnormal categories will benefit from a larger bin size.

Distributions 2014-2021 (Covid distibutions are years 2020-2021)

par(mfrow=c(2,2))

  hist(us_deaths_df[ , Cancer], breaks = 32, 
       main = "Cancer", 
       xlab = "Weekly Cancer Deaths (bins=32)") 
  hist(us_deaths_df[ Year > 2019, Covid_19_Multi], breaks = 14, 
       main = "Covid-19 Comorbidity", 
       xlab = "Weekly Covid-19 Comorbidity Deaths (bins=14)")
  hist(us_deaths_df[ Year > 2019, Covid_19], breaks = 14, 
       main = "Covid-19 Singular Cause", 
       xlab = "Weekly Covid-19 Deaths (bins=14)")
  hist(us_deaths_df[ , Abnormal_Finding], breaks = 51, 
       main = "Abnormal Finding", 
       xlab = "Weekly Abnormal Finding Deaths (bins=51)")

par(mfrow=c(1,1))

The cancer and abnormal findings categories show normal-like data distributions with cancer having possible outliers to the left and abnormal findings having significant outliers to the right. Outliers are expected with abnormal findings due to the diagnosis being itself an outlier from all other diagnoses. Cancer would be expected to be consistent with no weeks being significantly different than others. Therefore, cancer will require further analysis.

Exploring outliers in cancer data

plot(us_deaths_df$Week_End_Date, us_deaths_df$Cancer)

The outliers occur during the last 3 weeks of the data and possibly extend further. This is likely due to reporting delays and will affect all categories.

Examining possible reporting delays in last quarter of data

melt(us_deaths_df[ ,
                   Week_End_Date:Abnormal_Finding],
     id.vars = "Week_End_Date") %>%
  
  ggplot(aes(x=Week_End_Date, y = value, group=variable)) +
    geom_point() +
    geom_vline(xintercept=as.Date("2021-07-01"), linetype="solid", color = "blue") +
    facet_wrap(~ variable, scales = 'free_y') + 
    theme_bw()

All categories except abnormal findings and Covid-19 related categories show the same pattern of outliers likely due to delayed reporting. Some categories, such as Alzheimers, show possible delayed reporting since the end of the 2nd quarter (blue line at 2021-07-01). To prevent possible errors, the analysis will be limited to data through the 2nd quarter of 2021 (2021-06-30).

Distributions after dropping 2021 3rd quarter data

us_deaths_df[ , Natural:Abnormal_Finding]%>%
  hist()

par(mfrow=c(2,2))

  hist(us_deaths_df[ , Cancer], breaks = 32, 
       main = "Cancer", 
       xlab = "Weekly Cancer Deaths (bins=32)") 
  hist(us_deaths_df[ Year > 2019, Covid_19_Multi], breaks = 14, 
       main = "Covid-19 Comorbidity", 
       xlab = "Weekly Covid-19 Comorbidity Deaths (bins=14)")
  hist(us_deaths_df[ Year > 2019, Covid_19], breaks = 14, 
       main = "Covid-19 Singular Cause", 
       xlab = "Weekly Covid-19 Deaths (bins=14)")
  hist(us_deaths_df[ , Abnormal_Finding], breaks = 51, 
       main = "Abnormal Finding", 
       xlab = "Weekly Abnormal Finding Deaths (bins=51)")

par(mfrow=c(1,1))

Chronological data after dropping 2021 3rd quarter data

us_deaths_df <- us_deaths_df[Week_End_Date < as.Date("2021-07-01")]

melt(us_deaths_df[ ,
                   Week_End_Date:Abnormal_Finding],
     id.vars = "Week_End_Date") %>%
  
  ggplot(aes(x=Week_End_Date, y = value, group=variable)) +
    geom_point() +
    facet_wrap(~ variable, scales = 'free_y') + 
    theme_bw()

Seasonal Patterns Seen in Non-seasonal Causes

psych::pairs.panels(us_deaths_df[ , Natural:Abnormal_Finding], scale = TRUE)

psych::pairs.panels(us_deaths_df[ Year > 2019 , Natural:Abnormal_Finding], scale = TRUE)

psych::pairs.panels(us_deaths_df[ Year > 2019,
                                  .(Natural, Heart, Brain, Alzheimer, Diabetes, 
                                    Covid_19_Multi, Covid_19)], 
                    scale = TRUE)

End of file